World Cup Goalie Analysis

In this ipython notebook will take a bayesian approach to analyzing just how good Tim Howard's record was during the world cup compared to other goalies. I take an empirical bayes approach.

The actual Bayesian analysis is pretty straightforward but getting the data is not. As a result most of this notebook deals with scraping and preparing the data. Feel free to skip ahead to the actual analysis if this doesn't interest you.

Data Preparation

Unless you happen to be interested in how to scrape data, you should probably skip this section.


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [216]:
% ls data


2014 FIFA World Cup Brazil™  Brazil-Croatia - Statistics - FIFA.com.html
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top attacks - FIFA.com_files/
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top attacks - FIFA.com.html
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top defending - FIFA.com_files/
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top defending - FIFA.com.html
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top goals - FIFA.com_files/
2014 FIFA World Cup Brazil™ - Statistics - Matches - Top goals - FIFA.com.html
all_stats.json
attacking.csv
defending.csv
goals.csv
shots.csv
usa.json

In [3]:
# grab data!  First focus on world cup data from FIFA.  Tabular form

import pandas as pd

goals = pd.read_csv('data/goals.csv')
goals.head()


Out[3]:
Unnamed: 0 Goals For Goals from Inside Goals from Outside Own Goals For Open Play Goals Set Piece Goals Penalty Goal
0 BRA-GER 8 8 0 0 7 1 0
1 SUI-FRA 7 6 1 0 5 2 0
2 KOR-ALG 6 6 0 0 5 1 0
3 ESP-NED 6 6 0 0 5 1 1
4 AUS-NED 5 4 1 0 4 1 1

5 rows × 8 columns


In [217]:
goals.ix[:, 0][0:5]


Out[217]:
0    BRA-GER
1    SUI-FRA
2    KOR-ALG
3    ESP-NED
4    AUS-NED
Name: Unnamed: 0, dtype: object

In [5]:
goals.describe()


Out[5]:
Goals For Goals from Inside Goals from Outside Own Goals For Open Play Goals Set Piece Goals Penalty Goal
count 55.000000 55.000000 55.000000 55.000000 55.000000 55.000000 55.000000
mean 3.036364 2.545455 0.400000 0.090909 2.509091 0.527273 0.200000
std 1.586560 1.630930 0.564374 0.290129 1.412546 0.716332 0.403687
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.000000 1.000000 0.000000 0.000000 1.500000 0.000000 0.000000
50% 3.000000 2.000000 0.000000 0.000000 2.000000 0.000000 0.000000
75% 4.000000 3.000000 1.000000 0.000000 3.000000 1.000000 0.000000
max 8.000000 8.000000 2.000000 1.000000 7.000000 3.000000 1.000000

8 rows × 7 columns

Unfortunately this data isn't usable because it combines the stats for both teams! How lame. It looks like we'll have to scrap the data from all ~55 individual games. We can do this by:

  1. Find the round and match ID for each game from one of their statistics pages.
  2. Use this to go the statistics page for each match
  3. Scrape statistics from each match
  4. Load this into a good data structure for analysis

In [218]:
# from bs4 import BeautifulSoup

with open('data/2014 FIFA World Cup Brazil™ - Statistics - Matches - Top attacks - FIFA.com.html', 'r') as f:
    raw_page = f.read()

# big_page_soup = BeautifulSoup(raw_page)

In [219]:
import re

example_url = 'http://www.fifa.com/worldcup/matches/round=255931/match=300186456/index.html#nosticky'
url_re = r'http://www.fifa.com/worldcup/matches/round=(\d{6})/match=(\d{9})/index.html'

game_urls = re.findall(url_re, raw_page)

len(game_urls)


Out[219]:
68

In [220]:
game_urls[:5]


Out[220]:
[('255955', '300186474'),
 ('255955', '300186490'),
 ('255957', '300186502'),
 ('255959', '300186501'),
 ('255957', '300186502')]

In [46]:
# Scrape statistics from each game url

import requests
from bs4 import BeautifulSoup

test_url = 'http://www.fifa.com/worldcup/matches/round=255931/match=300186456/statistics.html'

def extract_stats(url):
    """function to extract all statistics"""
    r = requests.get(url).text
    soup = BeautifulSoup(r)
    
#     print(soup)

    # find teams
    header = soup.find('div', 'navbar-matchheader')
    home = header.find('div', 't home').find('span', 't-nTri').text  # or use 't-nText' for full name
    away = header.find('div', 't away').find('span', 't-nTri').text
    
    # find stats
    stats_tables = soup.find('div', 'general-stats').find_all('div', 'teamstats-wholematch')

    data = {}
    
    # extract data
    for stats in stats_tables:
        # find rows
        rows = None
        if stats:
            rows = stats.find_all('tr')
        
        # find stats in each row
        for row in rows:
            try:
                name = row.find('td', 'stats-name').text
                home_stats = row.find('td', attrs={'data-statref': 'home'}).text
                away_stats = row.find('td', attrs={'data-statref': 'away'}).text

                # append data
                data[name] = {home:home_stats, away:away_stats}
    
            except:
                print('error on ' + row)

    return data

test_soup = extract_stats(test_url)
test_soup


Out[46]:
{u'Blocked': {u'BRA': u'3', u'CRO': u'1'},
 u'Clearances': {u'BRA': u'7', u'CRO': u'13'},
 u'Dangerous attacks': {u'BRA': u'48', u'CRO': u'38'},
 u'Deliveries in Penalty area': {u'BRA': u'10', u'CRO': u'5'},
 u'Goals': {u'BRA': u'3', u'CRO': u'0'},
 u'Off-Target': {u'BRA': u'5', u'CRO': u'6'},
 u'On-Target': {u'BRA': u'9', u'CRO': u'4'},
 u'Passes Completed': {u'BRA': u'433', u'CRO': u'284'},
 u'Saves': {u'BRA': u'3', u'CRO': u'3'},
 u'Total': {u'BRA': u'14', u'CRO': u'10'},
 u'Total attempts': {u'BRA': u'14', u'CRO': u'10'},
 u'Woodwork': {u'BRA': u'0', u'CRO': u'0'}}

In [240]:
# Loop through all games and extract stats

import time

all_stats = {}

for round_id, match_id in game_urls:
    url = ('http://www.fifa.com/worldcup/matches/round=' + 
           str(round_id) + '/match=' + str(match_id) + '/statistics.html')
    stats_dict = extract_stats(url)
    all_stats['{}:{}'.format(round_id, match_id)] = stats_dict
    time.sleep(0.1)
    
# all_stats

In [48]:
# save stats for later use

import json

with open('data/all_stats.json', 'w') as f:
    json.dump(all_stats, f)

In [222]:
# Reformat data of interest!
# We care about a goalies chance of saving an on target goal
# So attempts that are on target, ex blocks

# The other thing is that because we care about goalies,
# we need to switch the team that we are tracking the statistics for

def get_goalie_stats(game, t1, t2):
    """Extracts savable attacks and saves for the goalie on t1"""
    saves = int(game['Saves'][t2])
    savable_attacks = int(game['On-Target'][t2]) - int(game['Blocked'][t2])
    return savable_attacks, saves

goalie_stats = {}
for game_no, (ids, game) in enumerate(all_stats.items()):
    if 'Goals' in game:
        teams = game['Goals'].keys()
        t1 = teams[0]
        t2 = teams[1]

        goalie_stats[t1] = (goalie_stats.get(t1, (0, 0))[0] + get_goalie_stats(game, t1, t2)[0], 
                            goalie_stats.get(t1, (0, 0))[1] + get_goalie_stats(game, t1, t2)[1] )
        goalie_stats[t2] = (goalie_stats.get(t2, (0, 0))[0] + get_goalie_stats(game, t2, t1)[0], 
                            goalie_stats.get(t2, (0, 0))[1] + get_goalie_stats(game, t2, t1)[1])

goalie_stats


Out[222]:
{u'ALG': (30, 23),
 u'ARG': (18, 15),
 u'AUS': (19, 10),
 u'BEL': (16, 13),
 u'BIH': (16, 13),
 u'BRA': (23, 10),
 u'CHI': (22, 18),
 u'CIV': (13, 8),
 u'CMR': (24, 15),
 u'COL': (27, 22),
 u'CRC': (23, 21),
 u'CRO': (11, 5),
 u'ECU': (21, 18),
 u'ENG': (10, 5),
 u'ESP': (14, 7),
 u'FRA': (13, 10),
 u'GER': (29, 24),
 u'GHA': (17, 12),
 u'GRE': (16, 11),
 u'HON': (18, 11),
 u'IRN': (16, 11),
 u'ITA': (15, 12),
 u'JPN': (14, 8),
 u'KOR': (18, 12),
 u'MEX': (13, 10),
 u'NED': (23, 18),
 u'NGA': (25, 21),
 u'POR': (15, 7),
 u'RUS': (13, 10),
 u'SUI': (29, 22),
 u'URU': (16, 10),
 u'USA': (33, 27)}

In [223]:
# Simple save percentages (no prior)

{k:float(saves)/savables for k, (savables, saves) in goalie_stats.items()}


Out[223]:
{u'ALG': 0.7666666666666667,
 u'ARG': 0.8333333333333334,
 u'AUS': 0.5263157894736842,
 u'BEL': 0.8125,
 u'BIH': 0.8125,
 u'BRA': 0.43478260869565216,
 u'CHI': 0.8181818181818182,
 u'CIV': 0.6153846153846154,
 u'CMR': 0.625,
 u'COL': 0.8148148148148148,
 u'CRC': 0.9130434782608695,
 u'CRO': 0.45454545454545453,
 u'ECU': 0.8571428571428571,
 u'ENG': 0.5,
 u'ESP': 0.5,
 u'FRA': 0.7692307692307693,
 u'GER': 0.8275862068965517,
 u'GHA': 0.7058823529411765,
 u'GRE': 0.6875,
 u'HON': 0.6111111111111112,
 u'IRN': 0.6875,
 u'ITA': 0.8,
 u'JPN': 0.5714285714285714,
 u'KOR': 0.6666666666666666,
 u'MEX': 0.7692307692307693,
 u'NED': 0.782608695652174,
 u'NGA': 0.84,
 u'POR': 0.4666666666666667,
 u'RUS': 0.7692307692307693,
 u'SUI': 0.7586206896551724,
 u'URU': 0.625,
 u'USA': 0.8181818181818182}

In [225]:
# number of non-saves
# {k:savables - saves for k, (savables, saves) in goalie_stats.items()}

Bayesian Statistics!

Here I will model a goalies probability of making a save as a bernoulli random variable with a prior given by the beta distribution.

First, I am going to set the prior by choosing the maximum likelihood estimates of the paramters for the beta distribution using all of the data. Looking at many binary outcomes and using a beta distribution as a conjugate prior means I am using the beta-binomial distribution.

Then, I will look at each Goalie's performance by using their results to calculate a posterior probability that they will save an on-target, on-blocked attempt.


In [65]:
total_savables = float(sum([a for a, b in goalie_stats.values()]))
total_savables


Out[65]:
610.0

In [66]:
total_saves = float(sum([b for a, b in goalie_stats.values()]))
total_saves


Out[66]:
439.0

In [226]:
# I expect the prior to have an expected mean close to the fraction of
# total saves divided by total savables

total_frac = total_saves/total_savables
total_frac


Out[226]:
0.719672131147541

In [179]:
# First, I will form the log-likelihood for the beta-binomial distribution.
# Luckily it has a nice closed form solution

from scipy.special import betaln, gammaln, comb

def nllf(a, b):
    result = 0
    for savables, saves in goalie_stats.values():
#         result += (log(comb(savables, saves)) + betaln(saves + a, savables - saves + b) - betaln(a, b))
        # think maybe using gammaln would be more precise    
        result += (gammaln(savables + 1) - gammaln(saves + 1) - gammaln(savables - saves + 1) + 
                   betaln(saves + a, savables - saves + b) - betaln(a, b))  
    return -result

# test
nllf(total_frac, 1.0-total_frac)


Out[179]:
105.25476226621079

In [180]:
nllf(total_saves, total_savables)


Out[180]:
185.09959480676741

In [181]:
nllf(0.5, 0.5)


Out[181]:
103.73915109562266

In [190]:
# Now we'll estimate the values of a and b via maximizing the log-likelihood

from scipy.optimize import minimize

def nllf_log_args(arg_array):
    return nllf(exp(arg_array[0]), exp(arg_array[1]))

result = minimize(nllf_log_args, [log(0.5), log(0.5)], tol=10**-3)

result


Out[190]:
   status: 0
  success: True
     njev: 12
     nfev: 48
 hess_inv: array([[ 0.48088923,  0.46077401],
       [ 0.46077401,  0.45425467]])
      fun: 72.39193662170837
        x: array([ 3.16494542,  2.25544616])
  message: 'Optimization terminated successfully.'
      jac: array([  6.77108765e-05,  -1.70707703e-04])

In [227]:
# Calculate the prior parameters in more usable forms

a, b = exp(result.x)
a


Out[227]:
23.687451098207482

In [228]:
b


Out[228]:
9.5395484971122819

In [229]:
# Calculated the expected value of the prior

prior_mean = a/(a+b)
prior_mean


Out[229]:
0.71289768521693453

In [204]:
# Plot the pdf of the prior distribution
# Make a function for repeat use

from scipy.stats import beta

def bayes_goalie_chart(posterior_1_country, posterior_2_country=None, chart_title=None):
    x = arange(0,1,0.01)
    
    if posterior_1_country:
        plot(x, beta(*posteriors[posterior_1_country]).pdf(x),
             label="{} Posterior".format(posterior_1_country))
    
    if posterior_2_country is not None:
        plot(x, beta(*posteriors[posterior_2_country]).pdf(x),
             label="{} Posterior".format(posterior_2_country))
    else:
        plot(x, beta(a, b).pdf(x), label="Prior")
    
    # Other nice formatting stuff
    title(chart_title)
    legend(loc="best")
    xlabel("Probability")
    ylabel("Density")
    show()
    
bayes_goalie_chart(None, chart_title="Probability of a Save")



In [122]:
# Calculate posterior distributions

posteriors = {k:(a + saves, b + savables - saves) for k, (savables, saves) in goalie_stats.items()}
posteriors


Out[122]:
{u'ALG': (46.688210323459558, 16.539863916112218),
 u'ARG': (38.688210323459558, 12.539863916112218),
 u'AUS': (33.688210323459558, 18.539863916112218),
 u'BEL': (36.688210323459558, 12.539863916112218),
 u'BIH': (36.688210323459558, 12.539863916112218),
 u'BRA': (33.688210323459558, 22.539863916112218),
 u'CHI': (41.688210323459558, 13.539863916112218),
 u'CIV': (31.688210323459558, 14.539863916112218),
 u'CMR': (38.688210323459558, 18.539863916112218),
 u'COL': (45.688210323459558, 14.539863916112218),
 u'CRC': (44.688210323459558, 11.539863916112218),
 u'CRO': (28.688210323459558, 15.539863916112218),
 u'ECU': (41.688210323459558, 12.539863916112218),
 u'ENG': (28.688210323459558, 14.539863916112218),
 u'ESP': (30.688210323459558, 16.539863916112218),
 u'FRA': (33.688210323459558, 12.539863916112218),
 u'GER': (47.688210323459558, 14.539863916112218),
 u'GHA': (35.688210323459558, 14.539863916112218),
 u'GRE': (34.688210323459558, 14.539863916112218),
 u'HON': (34.688210323459558, 16.539863916112218),
 u'IRN': (34.688210323459558, 14.539863916112218),
 u'ITA': (35.688210323459558, 12.539863916112218),
 u'JPN': (31.688210323459558, 15.539863916112218),
 u'KOR': (35.688210323459558, 15.539863916112218),
 u'MEX': (33.688210323459558, 12.539863916112218),
 u'NED': (41.688210323459558, 14.539863916112218),
 u'NGA': (44.688210323459558, 13.539863916112218),
 u'POR': (30.688210323459558, 17.539863916112218),
 u'RUS': (33.688210323459558, 12.539863916112218),
 u'SUI': (45.688210323459558, 16.539863916112218),
 u'URU': (33.688210323459558, 15.539863916112218),
 u'USA': (50.688210323459558, 15.539863916112218)}

In [124]:
posterior_means = {k:saves/(saves + misses) for k, (saves, misses) in posteriors.items()}
posterior_means


Out[124]:
{u'ALG': 0.73840949427871994,
 u'ARG': 0.75521500461897817,
 u'AUS': 0.64502110816743319,
 u'BEL': 0.74527006977591448,
 u'BIH': 0.74527006977591448,
 u'BRA': 0.59913505449117299,
 u'CHI': 0.75483729783192954,
 u'CIV': 0.68547545717000857,
 u'CMR': 0.67603550945119228,
 u'COL': 0.75858660434208169,
 u'CRC': 0.7947668656240271,
 u'CRO': 0.6486425379514178,
 u'ECU': 0.76875697520231101,
 u'ENG': 0.66364766018648691,
 u'ESP': 0.64978745836192309,
 u'FRA': 0.72873921048223234,
 u'GER': 0.76634559089623733,
 u'GHA': 0.71052316585418474,
 u'GRE': 0.70464284576006408,
 u'HON': 0.6771328190327367,
 u'IRN': 0.70464284576006408,
 u'ITA': 0.73998829283913037,
 u'JPN': 0.67096130497966455,
 u'KOR': 0.69665336542929712,
 u'MEX': 0.72873921048223234,
 u'NED': 0.74141273531506691,
 u'NGA': 0.76746845756216797,
 u'POR': 0.63631423827989952,
 u'RUS': 0.72873921048223234,
 u'SUI': 0.73420575651376552,
 u'URU': 0.68432923375213883,
 u'USA': 0.76535836056626516}

In [215]:
# Calculate the 'rank' of how good a goalie is

ranked_posterior_means = [-x for x in sort([-x for x in posterior_means.values()])]
posterior_rank = {k:ranked_posterior_means.index(v) + 1 for k, v in posterior_means.items()}
posterior_rank


Out[215]:
{u'ALG': 13,
 u'ARG': 7,
 u'AUS': 30,
 u'BEL': 9,
 u'BIH': 9,
 u'BRA': 32,
 u'CHI': 8,
 u'CIV': 22,
 u'CMR': 25,
 u'COL': 6,
 u'CRC': 1,
 u'CRO': 29,
 u'ECU': 2,
 u'ENG': 27,
 u'ESP': 28,
 u'FRA': 15,
 u'GER': 4,
 u'GHA': 18,
 u'GRE': 19,
 u'HON': 24,
 u'IRN': 19,
 u'ITA': 12,
 u'JPN': 26,
 u'KOR': 21,
 u'MEX': 15,
 u'NED': 11,
 u'NGA': 3,
 u'POR': 31,
 u'RUS': 15,
 u'SUI': 14,
 u'URU': 23,
 u'USA': 5}

Aww, do it turns out that according to this analysis USA is not #1. Tim Howard was good, but it's hard to come out on top of Costa Rica which only allowed two goals.

Let's look at a handful of countries to get a feel for the data. Why not start with Brazil, who seemed to have the worst performance in the World Cup?


In [207]:
bayes_goalie_chart('BRA', chart_title="Brazil's Posterior Probability of a Save")


By comparison, how did the USA compare to the Prior?


In [237]:
bayes_goalie_chart('USA', chart_title="USA's Posterior Probability of a Save")


And what does it look like when we compare the two?


In [238]:
bayes_goalie_chart('USA', posterior_2_country='BRA', chart_title="USA vs BRA")


How does the USA posterior compare to Costa Rica's?


In [210]:
bayes_goalie_chart('USA', posterior_2_country='CRC', chart_title="USA vs CRC")


Pretty similar.

Since I did this analysis in the Netherlands, I thought I'd compare it to the Netherland's performance.


In [211]:
bayes_goalie_chart('USA', posterior_2_country='NED', chart_title="USA vs NED")


And how different are all of the goalies from each other?


In [235]:
for a, b in posteriors.values():
    x = arange(0, 1, 0.01)
    plot(x, beta(a, b).pdf(x), color='b', alpha=0.25)
    
title('All Posteriors')
show()


Again, pretty similar. Soccer seems to have prety random outcomes to me.

Further work

  • Differentiate between different types of attacks (but where's the data?)
  • Other ways to estimate the prior

In [ ]: